19 research outputs found

    Quantifying the Resiliency of Fail-Operational Real-Time Networked Control Systems

    Get PDF
    In time-sensitive, safety-critical systems that must be fail-operational, active replication is commonly used to mitigate transient faults that arise due to electromagnetic interference (EMI). However, designing an effective and well-performing active replication scheme is challenging since replication conflicts with the size, weight, power, and cost constraints of embedded applications. To enable a systematic and rigorous exploration of the resulting tradeoffs, we present an analysis to quantify the resiliency of fail-operational networked control systems against EMI-induced memory corruption, host crashes, and retransmission delays. Since control systems are typically robust to a few failed iterations, e.g., one missed actuation does not crash an inverted pendulum, traditional solutions based on hard real-time assumptions are often too pessimistic. Our analysis reduces this pessimism by modeling a control system\u27s inherent robustness as an (m,k)-firm specification. A case study with an active suspension workload indicates that the analytical bounds closely predict the failure rate estimates obtained through simulation, thereby enabling a meaningful design-space exploration, and also demonstrates the utility of the analysis in identifying non-trivial and non-obvious reliability tradeoffs

    From Iteration to System Failure: Characterizing the FITness of Periodic Weakly-Hard Systems

    Get PDF
    Estimating metrics such as the Mean Time To Failure (MTTF) or its inverse, the Failures-In-Time (FIT), is a central problem in reliability estimation of safety-critical systems. To this end, prior work in the real-time and embedded systems community has focused on bounding the probability of failures in a single iteration of the control loop, resulting in, for example, the worst-case probability of a message transmission error due to electromagnetic interference, or an upper bound on the probability of a skipped or an incorrect actuation. However, periodic systems, which can be found at the core of most safety-critical real-time systems, are routinely designed to be robust to a single fault or to occasional failures (case in point, control applications are usually robust to a few skipped or misbehaving control loop iterations). Thus, obtaining long-run reliability metrics like MTTF and FIT from single iteration estimates by calculating the time to first fault can be quite pessimistic. Instead, overall system failures for such systems are better characterized using multi-state models such as weakly-hard constraints. In this paper, we describe and empirically evaluate three orthogonal approaches, PMC, Mart, and SAp, for the sound estimation of system\u27s MTTF, starting from a periodic stochastic model characterizing the failure in a single iteration of a periodic system, and using weakly-hard constraints as a measure of system robustness. PMC and Mart are exact analyses based on Markov chain analysis and martingale theory, respectively, whereas SAp is a sound approximation based on numerical analysis. We evaluate these techniques empirically in terms of their accuracy and numerical precision, their expressiveness for different definitions of weakly-hard constraints, and their space and time complexities, which affect their scalability and applicability in different regions of the space of weakly-hard constraints

    Towards “Ultra-Reliable” CPS: Reliability Analysis of Distributed Real-Time Systems

    No full text
    In the avionics domain, “ultra-reliability” refers to the practice of ensuring quantifiably negligible residual failure rates in the presence of transient and permanent hardware faults. If autonomous Cyber- Physical Systems (CPS) in other domains, e.g., autonomous vehicles, drones, and industrial automation systems, are to permeate our everyday life in the not so distant future, then they also need to become ultra-reliable. However, the rigorous reliability engineering and analysis practices used in the avionics domain are expensive and time consuming, and cannot be transferred to most other CPS domains. The increasing adoption of faster and cheaper, but less reliable, Commercial Off-The-Shelf (COTS) hardware is also an impediment in this regard. Motivated by the goal of ultra-reliable CPS, this dissertation shows how to soundly analyze the reliability of COTS-based implementations of actively replicated Networked Control Systems (NCSs)—which are key building blocks of modern CPS—in the presence of transient hard- ware faults. When an NCS is deployed over field buses such as the Controller Area Network (CAN), transient faults are known to cause host crashes, network retransmissions, and incorrect computations. In addition, when an NCS is deployed over point-to-point networks such as Ethernet, even Byzantine errors (i.e., inconsistent broadcast transmissions) are possible. The analyses proposed in this dissertation account for NCS failures due to each of these error categories, and consider NCS failures in both time and value domains. The analyses are also provably free of reliability anomalies. Such anomalies are problematic because they can result in unsound failure rate estimates, which might lead us to believe that a system is safer than it actually is. Specifically, this dissertation makes four main contributions. (1) To reduce the failure rate of NCSs in the presence of Byzantine errors, we present a hard real-time design of a Byzantine Fault Tolerance (BFT) protocol for Ethernet-based systems. (2) We then propose a quantitative reliability analysis of the presented design in the presence of transient faults. (3) Next, we propose a similar analysis to upper-bound the failure probability of an actively replicated CAN-based NCS. (4) Finally, to upper-bound the long-term failure rate of the NCS more accurately, we propose analyses that take into account the temporal robustness properties of an NCS expressed as weakly-hard constraints. By design, our analyses can be applied in the context of full-system analyses. For instance, to certify a system consisting of multiple actively replicated NCSs deployed over a BFT atomic broadcast layer, the upper bounds on the failure rates of each NCS and the atomic broadcast layer can be composed using the sum-of-failure-rates model

    Towards “Ultra-Reliable” CPS: Reliability Analysis of Distributed Real-Time Systems

    No full text
    In the avionics domain, “ultra-reliability” refers to the practice of ensuring quantifiably negligible residual failure rates in the presence of transient and permanent hardware faults. If autonomous Cyber- Physical Systems (CPS) in other domains, e.g., autonomous vehicles, drones, and industrial automation systems, are to permeate our everyday life in the not so distant future, then they also need to become ultra-reliable. However, the rigorous reliability engineering and analysis practices used in the avionics domain are expensive and time consuming, and cannot be transferred to most other CPS domains. The increasing adoption of faster and cheaper, but less reliable, Commercial Off-The-Shelf (COTS) hardware is also an impediment in this regard. Motivated by the goal of ultra-reliable CPS, this dissertation shows how to soundly analyze the reliability of COTS-based implementations of actively replicated Networked Control Systems (NCSs)—which are key building blocks of modern CPS—in the presence of transient hard- ware faults. When an NCS is deployed over field buses such as the Controller Area Network (CAN), transient faults are known to cause host crashes, network retransmissions, and incorrect computations. In addition, when an NCS is deployed over point-to-point networks such as Ethernet, even Byzantine errors (i.e., inconsistent broadcast transmissions) are possible. The analyses proposed in this dissertation account for NCS failures due to each of these error categories, and consider NCS failures in both time and value domains. The analyses are also provably free of reliability anomalies. Such anomalies are problematic because they can result in unsound failure rate estimates, which might lead us to believe that a system is safer than it actually is. Specifically, this dissertation makes four main contributions. (1) To reduce the failure rate of NCSs in the presence of Byzantine errors, we present a hard real-time design of a Byzantine Fault Tolerance (BFT) protocol for Ethernet-based systems. (2) We then propose a quantitative reliability analysis of the presented design in the presence of transient faults. (3) Next, we propose a similar analysis to upper-bound the failure probability of an actively replicated CAN-based NCS. (4) Finally, to upper-bound the long-term failure rate of the NCS more accurately, we propose analyses that take into account the temporal robustness properties of an NCS expressed as weakly-hard constraints. By design, our analyses can be applied in the context of full-system analyses. For instance, to certify a system consisting of multiple actively replicated NCSs deployed over a BFT atomic broadcast layer, the upper bounds on the failure rates of each NCS and the atomic broadcast layer can be composed using the sum-of-failure-rates model

    When Is CAN the Weakest Link? A Bound on Failures-in-Time in CAN-Based Real-Time Systems

    No full text
    Abstract—A method to bound the Failures In Time (FIT) rate of a CAN-based real-time system, i.e., the expected number of failures in one billion operating hours, is proposed. The method leverages an analysis, derived in the paper, of the probability of a correct and timely message transmission despite host and network failures due to electromagnetic interference (EMI). For a given workload, the derived FIT rate can be used to find an optimal replication factor, which is demonstrated with a case study based on a message set taken from a simple mobile robot. I

    Multiprocessor real-time scheduling with arbitrary processor affinities: from practice to theory

    No full text
    Abstract Contemporary multiprocessor real-time operating systems, such as VxWorks, LynxOS, QNX, and real-time variants of Linux, allow a process to have an arbitrary processor affinity, that is, a process may be pinned to an arbitrary subset of the processors in the system. Placing such a hard constraint on process migrations can help to improve cache performance of specific multi-threaded applications, achieve isolation among applications, and aid in load-balancing. However, to date, the lack of schedulability analysis for such systems prevents the use of arbitrary processor affinities in predictable hard real-time systems. This paper presents the first analysis of multiprocessor scheduling with arbitrary processor affinities from a real-time perspective. It is shown that job-level fixed-priority scheduling with arbitrary processor affinities is strictly more general than global, clustered, and partitioned job-level fixed-priority scheduling combined. Concerning the more general case of job-level dynamic priorities, it is shown that global and clustered scheduling are equivalent to multiprocessor real-time scheduling with arbitrary processor affinities. The Linux push and pull scheduler is studied as a reference implementation and two approaches for the schedulability analysis of hard real-time tasks with arbitrary processor affinity masks are presented. In the first approach, the scheduling problem is reduced to “global-like ” sub-problems to which existing global schedulability tests can be applied. The second approach is specifically based on response-time analysis and models the response-time computation as a linear optimization problem. The latter linear-programming-based approach has better runtime complexity than the former reduction-based approach. Schedulability experiments show the proposed techniques to be effective.

    Linux's Processor Affinity API, Refined: Shifting Real-Time Tasks Towards Higher Schedulability

    No full text
    A portrait of Glenda Merchant in jump gear. Merchant is a crew member at the Siskiyou Smokejumper Base.https://dc.ewu.edu/beck/1280/thumbnail.jp

    Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities

    No full text
    Abstract—Contemporary multiprocessor real-time operating systems, such as VxWorks, LynxOS, QNX, and real-time variants of Linux, allow a process to have an arbitrary processor affinity, that is, a process may be pinned to an arbitrary subset of the processors in the system. Placing such a hard constraint on process migrations can help to improve cache performance of specific multi-threaded applications, achieve isolation among components, and aid in load-balancing. However, to date, the lack of schedulability analysis for such systems prevents the use of arbitrary processor affinities in predictable hard real-time applications. In this paper, it is shown that job-level fixed-priority scheduling with arbitrary processor affinities is strictly more general than global, clustered, and partitioned job-level fixed-priority scheduling. The Linux push and pull scheduler is studied as a reference implementation and techniques for the schedulability analysis of hard real-time tasks with arbitrary processor affinity masks are presented. The proposed tests work by reducing the scheduling problem to “global-like ” sub-problems to which existing global schedulability tests can be applied. Schedulability experiments show the proposed techniques to be effective. I
    corecore